Enhanced face-detection and face-tracking for embedded vision systems

ABSTRACT

Embodiments described herein provide various examples of a face-detection system. In one aspect, a process for performing image detections on grayscale images is disclosed. This process can begin by receiving a training image dataset, wherein the training image dataset includes a first subset of color images. The process then converts each image in the first subset of color images in the training image dataset into a grayscale image to obtain a first subset of converted grayscale images. Next, the process trains an image-detection statistical model using the training image dataset including the first subset of converted grayscale images. The process next receives a set of grayscale input images. The process subsequently performs image detections on the set of grayscale input images using the trained image-detection statistical model. Note that performing image detections on grayscale input images using an image-detection model trained on grayscale training images improves image detection accuracy over using an image-detection model trained on color training images.

PRIORITY CLAIM AND RELATED PATENT APPLICATIONS

This patent application is a continuation of, and hereby claims thebenefit of priority under 35 U.S.C. § 120 to co-pending U.S. patentapplication Ser. No. 15/943,728, filed on 3 Apr. 2018 (Attorney DocketNo. AVS005.US02CIP), entitled, “Enhanced Face-Detection andFace-Tracking For Resource-Limited Embedded Vision Systems,” which inturn is a continuation-in-part of, and claims the benefit of priorityunder 35 U.S.C. § 120 to U.S. patent application Ser. No. 15/796,798,filed on 28 Oct. 2017 (Attorney Docket No. AVS005.US01), entitled,“Method and Apparatus for Real-time Face-Tracking andFace-Pose-Selection on Embedded Vision Systems,” all of the above-listedapplications are incorporated herein by reference as a part of thispatent document.

TECHNICAL FIELD

The present disclosure generally relates to the field of machinelearning and artificial intelligence, and more specifically to systems,devices and techniques for performing real-time face-detection,face-tracking and duplicated-face-detection on digital images capturedon resource-limited embedded vision systems.

BACKGROUND

Deep learning (DL) is a branch of machine learning and artificial neuralnetwork based on a set of algorithms that attempt to model high levelabstractions in data by using a deep graph with multiple processinglayers. A typical DL architecture can include many layers of neurons andmillions of parameters. These parameters can be trained from largeamount of data on fast GPU-equipped computers, guided by novel trainingtechniques that can work with many layers, such as rectified linearunits (ReLU), dropout, data augmentation, and stochastic gradientdescent (SGD).

Among the existing DL architectures, convolutional neural network (CNN)is one of the most popular DL architectures. Although the idea behindCNN has been known for more than 20 years, the true power of CNN hasonly been recognized after the recent development of the deep learningtheory. To date, CNN has achieved numerous successes in many artificialintelligence and machine learning applications, such as facerecognition, image classification, image caption generation, visualquestion answering, and automatic driving cars.

Face detection, i.e., detecting and locating the position of each facein an image, is usually the first step in many face recognitionapplications. A modern face detection system often includes two mainmodules: a face detection module and a face tracking module. A facedetection module often employs a DL architecture such as CNN to detecthuman faces in digital images. Once a new face (i.e., a new person) isdetected by the face detection module in an image frame of a video, theface tracking module tracks the new person through subsequent imageframes in the video to find/re-identify the same person in each of thesubsequent image frames. For some low-complexity embedded systemapplications, the face tracking module can be implemented based on somesimple tracking techniques, e.g., Kalman filter and Hungarian algorithm.

In many face detection applications, it is also desirable to performface-pose estimation because each person's head/face can have differentorientations, i.e., different poses in different images. Moreover, toavoid sending and storing too many faces of the same person, it is alsodesirable to keep track of the pose change of each face, and send justthe face image corresponding to the “best pose,” e.g., the face that isthe closest to the frontal view (i.e., with the smallest rotations) ofeach detected person. Various techniques can be used to estimate thepose of the person's head/face. One example technique is to firstestimate the locations of some facial landmarks, such as eyes, nose, andmouth, and then estimate the pose based on these landmark locations.Another technique involves representing the head pose with three Eulerangles, i.e., yaw, pitch and roll, and estimating the pose directly withthese three angles. Hence, when a tracked person/face is lost by theface tracking module, and the corresponding face tracker needs to bedestroyed, a face image of that face having the best pose can betransmitted and stored for further references. Using a CNN-based DLarchitecture, face detection and face-pose-estimation can be performedas a joint process.

A large number of face detection techniques can easily detect nearfrontal faces. However, robust and fast face detection in uncontrolledsituations can still be a challenging problem, because such situationsare often associated with significant amount of variations of faces,including pose changes, occlusions, exaggerated expressions, and extremeillumination variations. Some effective CNN-based face detectiontechniques that can manage such uncontrolled situations include (1) acascaded-CNN framework described in Li et al., “A Convolutional NeuralNetwork Cascade for Face Detection,” Proc. IEEE Conf on Computer Visionand Pattern Recognition, Jun. 1, 2015 (referred to as “the cascaded CNN”or “the cascaded CNN framework” hereinafter”), and (2) amultitask-cascaded-CNN framework described in Zhang et al., “Joint FaceDetection and Alignment Using Multitask Cascaded ConvolutionalNetworks,” IEEE Signal Processing Letters, Vol. 23, No. 10, pp.1499-1503, October 2016 (referred to as “the MTCNN” or “the MTCNNframework” hereinafter).

However, due to the high complexity involved in the MTCNN framework andoften limited computational resources available within an embeddedsystem, many challenges exist to implement the MTCNN-based facedetection into an embedded system to achieve satisfactory real-timeperformance. Moreover, the simple face tracking techniques used by theembedded systems often result in many near-duplicate faces being trackedand transmitted, thereby wasting computational resources and networkbandwidth.

SUMMARY

Embodiments described herein provide various examples of a real-timeface-detection, face-tracking, and face-pose-selection subsystem withinan embedded vision system. In one aspect, a process for performing imagedetections on grayscale images is disclosed. This process can begin byreceiving a training image dataset, wherein the training image datasetincludes a first subset of color images. The process then converts eachimage in the first subset of color images in the training image datasetinto a grayscale image to obtain a first subset of converted grayscaleimages. Next, the process trains an image-detection statistical modelusing the training image dataset including the first subset of convertedgrayscale images. The process next receives a set of grayscale inputimages. The process subsequently performs image detections on the set ofgrayscale input images using the trained image-detection statisticalmodel.

In some embodiments, the training image dataset is a large-scale publictraining dataset composed of primarily color images.

In some embodiments, the set of grayscale input images are capturedunder a monochrome or a grayscale illumination condition.

In some embodiments, the monochrome or grayscale illumination conditionincludes an LED lighting.

In some embodiments, the set of grayscale input images are captured by acamera configured to capture only grayscale images.

In some embodiments, the training image dataset further includes asecond subset of grayscale images, and training the image detectionstatistical model using the training image dataset includes using boththe first subset of converted grayscale images and the second subset ofgrayscale images.

In some embodiments, the training image dataset is a face-image trainingdataset, and the image-detection statistical model is a face-detectionstatistical model.

In some embodiments, the face-detection statistical model includes a CNNface-detection module, and the CNN face-detection module furtherincludes a multitask-cascaded-CNN (MTCNN).

In some embodiments, converting the training image dataset intograyscale images reduces data distribution skews between the trainingimage dataset and the set of grayscale input images.

In some embodiments, performing image detections on the set of grayscaleinput images using the image-detection statistical model trained on thegrayscale images improves image detection accuracy over using animage-detection statistical model trained on color training images.

In another aspect, a system for performing image detections on grayscaleimages is disclosed. This system includes: one or more processors; and amemory coupled to the one or more processors. The memory storesinstructions that, when executed by the one or more processors, causethe system to: receive a training image dataset, wherein the trainingimage dataset includes a first subset of color images; convert eachimage in the first subset of color images in the training image datasetinto a grayscale image to obtain a first subset of converted grayscaleimages; train an image-detection statistical model using the trainingimage dataset including the first subset of converted grayscale images;receive a set of grayscale input images; and perform image detections onthe set of grayscale input images using the trained image-detectionstatistical model. Note that converting the training image dataset intograyscale images reduces data distribution skews between the trainingimage dataset and the set of grayscale input images. Moreover,performing image detections on the set of grayscale input images usingthe image-detection statistical model trained on the grayscale imagesimproves image detection accuracy over using an image-detectionstatistical model trained on color training images.

BRIEF DESCRIPTION OF THE DRAWINGS

The structure and operation of the present disclosure will be understoodfrom a review of the following detailed description and the accompanyingdrawings in which like reference numerals refer to like parts and inwhich:

FIG. 1 illustrates an exemplary embedded vision system which includesreal-time face-detection, face-tracking, and face-pose-estimationfunctionalities in accordance with some embodiments described herein.

FIG. 2 shows a block diagram of an exemplary implementation of theface-detection-and-tracking subsystem within the embedded vision systemof FIG. 1 in accordance with some embodiments described herein.

FIG. 3 presents a flowchart illustrating an exemplary process forperforming real-time face-detection, face-pose-estimation andface-tracking within an embedded vision system in accordance with someembodiments described herein.

FIG. 4 presents a flowchart illustrating an exemplary process fordetecting if a tracked person has disappeared from the video inaccordance with some embodiments described herein.

FIG. 5 illustrates a captured sequence of video frames and acorresponding subset of processed video frames in accordance with someembodiments described herein.

FIG. 6 presents a flowchart illustrating an exemplary process forperforming face detection and tracking for unprocessed video framesbased on the processed video frames in accordance with some embodimentsdescribed herein.

FIG. 7 shows a block diagram of an exemplary implementation of theface-detection-and-tracking subsystem within the embedded vision systemof FIG. 1 in accordance with some embodiments described herein.

FIG. 8 presents a flowchart illustrating an exemplary process forimproving face detection performance on processing grayscale inputimages in accordance with some embodiments described herein.

FIG. 9 presents a flowchart illustrating an exemplary process foridentifying near-duplicate faces and selectively transmittingbest-pose-faces to the server in accordance with some embodimentsdescribed herein.

FIG. 10 illustrates an example client-server network environment whichprovides for implementing the disclosed embedded vision system inaccordance with some embodiments described herein.

DETAILED DESCRIPTION

The detailed description set forth below is intended as a description ofvarious configurations of the subject technology and is not intended torepresent the only configurations in which the subject technology may bepracticed. The appended drawings are incorporated herein and constitutea part of the detailed description. The detailed description includesspecific details for the purpose of providing a thorough understandingof the subject technology. However, the subject technology is notlimited to the specific details set forth herein and may be practicedwithout these specific details. In some instances, structures andcomponents are shown in block diagram form in order to avoid obscuringthe concepts of the subject technology.

Throughout the specification, the following terms have the meaningsprovided herein, unless the context clearly dictates otherwise. Theterms “head pose,” “face pose,” and “pose” are used interchangeably tomean the specific orientation of a person's head within an image. Theterms “a tracked person,” “a person being tracked,” and “a face tracker”are used interchangeably to mean a person being detected and tracked indigital video images by the disclosed face-detection and face-trackingsystems.

Embedded Vision Systems

FIG. 1 illustrates an exemplary embedded vision system 100 whichincludes real-time face-detection, face-pose-estimation, andface-tracking functionalities in accordance with some embodimentsdescribed herein. Embedded vision system 100 can be integrated with orimplemented as a surveillance camera system, a machine vision system, adrone system, a robotic system, a self-driving vehicle, or a mobiledevice. As can be seen in FIG. 1, embedded vision system 100 can includea bus 102, a processor 104, a memory 106, a storage device 108, a camerasubsystem 110, a face-detection-and-tracking subsystem 112, an outputdevice interface 120, and a network interface 122. In some embodiments,embedded vision system 100 is a low-cost embedded system.

Bus 102 collectively represents all system, peripheral, and chipsetbuses that communicatively couple the various components of embeddedvision system 100. For instance, bus 102 communicatively couplesprocessor 104 with memory 106, storage device 108, camera system 110,face-detection-and-tracking subsystem 112, output device interface 120,and network interface 122.

From memory 106, processor 104 retrieves instructions to execute anddata to process in order to control various components of embeddedvision system 100. Processor 104 can include any type of processor,including, but not limited to, a microprocessor, a mainframe computer, adigital signal processor (DSP), a personal organizer, a devicecontroller and a computational engine within an appliance, and any otherprocessor now known or later developed. Furthermore, processor 104 caninclude one or more cores. Processor 104 itself can include a cache thatstores code and data for execution by processor 104.

Memory 106 can include any type of memory that can store code and datafor execution by processor 104. This includes but not limited to,dynamic random access memory (DRAM), static random access memory (SRAM),flash memory, read only memory (ROM), and any other type of memory nowknown or later developed.

Storage device 108 can include any type of non-volatile storage devicethat can be integrated with embedded vision system 100. This includes,but is not limited to, magnetic, optical, and magneto-optical storagedevices, as well as storage devices based on flash memory and/orbattery-backed up memory.

Bus 102 is also coupled to camera subsystem 110. Camera subsystem 110 isconfigured to capture still images and/or video images at predeterminedresolutions and couple the captured image or video data to variouscomponents within embedded vision system 100 via bus 102, such as tomemory 106 for buffering and to face-detection-and-tracking subsystem112 for face-detection, face-pose-estimation, face-tracking, andbest-pose-selection. Camera subsystem 110 can include one or moredigital cameras. In some embodiments, camera subsystem 110 includes oneor more digital cameras equipped with wide-angle lenses. The capturedimages or videos by camera subsystem 110 can have different resolutionsincluding high-resolutions such as at 1280×720p, 1920×1080p or otherhigh resolutions.

Face-detection-and-tracking subsystem 112 further includes a facetracking module 114, a joint face-detection and face-pose-estimationmodule 116, and a best-pose-selection module 118. In some embodiments,face-detection-and-tracking subsystem 112 is configured to receive thecaptured video images, such as captured high-resolution video images viabus 102, perform CNN-based face-detection and face-pose-estimationoperations on the received video images using joint face-detection andface-pose-estimation module 116 to detect faces within each video imageand generate face-pose-estimations for each detected face.Face-detection-and-tracking subsystem 112 is further configured to trackeach uniquely detected face through a sequence of video images usingface tracking module 114, and determine the best pose for each trackedface using best-pose-selection module 118 when a tracked face isdetermined to be lost. Joint face-detection and face-pose-estimationmodule 116 can be implemented with one or more hardware CNN modules.Generally, face-detection-and-tracking subsystem 112 is configured totrack multiple people at the same time. Ideally,face-detection-and-tracking subsystem 112 should be configured tosimultaneously track as many persons as possible. In some embodiments,joint face-detection and face-pose-estimation module 116 can beimplemented with a coarse-to-fine multi-stage MTCNN architecture. Ifembedded vision system 100 is a low-cost embedded system, jointface-detection and face-pose-estimation module 116 can be implementedwith one or more low-cost hardware CNN modules such as a built-in CNNmodule within HiSilicon Hi3519 system-on-chip (SoC).

Output device interface 120 which is also coupled to bus 102, enablesfor example, the display of the results generated byface-detection-and-tracking subsystem 112. Output devices used withoutput device interface 120 include, for example, printers and displaydevices, such as cathode ray tube displays (CRT), light-emitting diodedisplays (LED), liquid crystal displays (LCD), organic light-emittingdiode displays (OLED), plasma displays, or electronic paper.

Finally, as shown in FIG. 1, bus 102 also couples embedded vision system100 to a network (not shown) through a network interface 122. In thismanner, embedded vision system 100 can be a part of a network (such as alocal area network (“LAN”), a wide area network (“WAN”), or an Intranet,or a network of networks, such as the Internet. In some embodiments,face-detection-and-tracking subsystem 112 is configured to send adetected face having the best pose among multiple detected faces of agiven person to a control center or a main server through networkinterface 122 and the network. Any or all components of embedded visionsystem 100 can be used in conjunction with the subject disclosure.

The Proposed Face-Detection-and-Tracking Architecture

FIG. 2 shows a block diagram of an exemplary implementation 200 offace-detection-and-tracking subsystem 112 within embedded vision system100 in accordance with some embodiments described herein. As shown inFIG. 2, face-detection-and-tracking subsystem 200 receives a sequence ofvideo images of a captured video 202 as input and generates thebest-pose face image 220 for each uniquely detected face/person fromvideo 202 as output. Note that face-detection-and-tracking subsystem 200includes at least a motion detection module 204, a face detection module206, a face-pose-estimation module 208, a best-pose-selection module210, and a face tracking module 212. As mentioned above,face-detection-and-tracking subsystem 200 is configured to trackmultiple people within captured video 202 at the same time. Ideally,face-detection-and-tracking subsystem 200 is configured tosimultaneously track as many people as possible.Face-detection-and-tracking subsystem 200 can also include additionalmodules not shown in FIG. 2. We now describe each of the blocks inface-detection-and-tracking subsystem 200 in more detail.

Motion Detection

As can be seen, a given video image of captured video 202 is firstreceived by motion detection module 204. In some embodiments, it isassumed that a human face captured in video image 202 is associated witha motion, which begins when a person first enters the field of view ofthe camera and ends when that same person exits the field of view of thecamera or being obscured by another person or an object. Hence, toreduce the computational complexity of face-detection-and-trackingsubsystem 200, motion detection module 204 can be used to preprocesseach video frame to locate and identify those areas within each videoframe which are associated with motions. In this manner, face detectionmodule 206 is only needed to operate on those detected moving areas todetect human faces, whereas the remaining areas within the video imagewhich are not associated with a motion can be ignored (i.e., not furtherprocessed by face-detection-and-tracking subsystem 200), therebyincreasing overall system efficiency and image processing speed.However, there are scenarios where a person enters the video then stopsmoving. In such cases, an initially moving face becomes a still face. Weprovide some techniques below which are capable of detecting andhandling such still faces.

In some embodiments, motion detection module 204 detects moving areaswithin a newly received video image by directly computing a differenceimage between the current video image and a previously video image in asequence of video frames. In one embodiment, the current video image iscompared with the immediate preceding video image with respect to thecurrent image in the sequence of video frames. In some embedded systems,motion detection module 204 can be implemented with built-in motiondetection hardware, such as a DSP. For example, whenface-detection-and-tracking subsystem 200 is implemented using Hi3519SoC, motion detection module 204 can be implemented with the built-inmotion detection function implemented on a DSP within Hi3519 SoC. Theoutput from motion detection module 204 includes a set of identifiedmoving areas 214 which can have many different sizes. Note that, for adetected moving human object, the associated moving area can includeboth the human face and the human body, and can also include more thanone face. Hence, each identified moving area is then sent to thesubsequent face detection module 206 to detect most or all faces withineach detected moving area.

More detail of motion detection module 204 is described in U.S. patentapplication Ser. No. 15/789,957, filed on 20 Oct. 2017 and entitled“Joint Face-Detection and Head-Pose-Angel-Estimation Using Small-scaleConvolutional Neural Network (CNN) Modules for Embedded Systems,”(Attorney Docket No. AVS002.US02CIP), the content of which isincorporated herein by reference.

Face-Detection and Face-Pose-Estimation

For each of the detected moving areas 214 generated by motion detectionmodule 204, a CNN-based face detection module 206 can be used to detectsome or all faces with the detected moving area. Many differenttechniques can be used to implement face detection module 206. Forexample, a histogram of oriented gradients (HoG)-feature comparisontechnique in conjunction with a support vector machine (SVM) classifiercan be used to implement face detection module 206. In some embodiments,face detection module 206 can be implemented with a coarse-to-finemulti-stage CNN architecture described in Zhang et al., “Joint FaceDetection and Alignment Using Multitask Cascaded ConvolutionalNetworks,” IEEE Signal Processing Letters, Vol. 23, No. 10, pp.1499-1503, October 2016. Other than using the MTCNN architecture, facedetection module 206 can be implemented with other known or laterdeveloped CNN-based face-detection architectures and techniques withoutdeparting from the scope of the described technology. Face detectionmodule 206 generates a set of detected faces 216 and the correspondingbounding box locations. Note that face tracking module 210 can be usedto track previously detected faces of processed video images based onthe current output of face detection module 206 associated with a newlyprocessed video image.

When a person is moving in a video, the person's head/face can havedifferent orientations, i.e., different poses in different video images.Estimating the pose of each detected face allows for keeping track ofthe pose change of each face through the sequence of video frames, andwhen a face tracker is considered lost and needs to be removed, sendingjust the face image corresponding to the “best pose,” i.e., the faceimage that is the closest to the frontal view (i.e., having the smallestrotations) of each detected person to the main server for facerecognition. Face-detection-and-tracking subsystem 200 uses aface-pose-estimation module 208 to estimate the pose of each detectedface from face detection module 206 and generate face-pose estimations218. The outputs from face-pose-estimation module 208 can be used bybest-pose-selection module 210 to update the best pose for each trackedperson as that person moves through the sequence of video frames.

In one technique, face pose is estimated based on the locations of somefacial landmarks, such as eyes, nose, and mouth, e.g., by computingdistances of these facial landmarks from the frontal view. Anothertechnique for face pose estimation involves representing the face posewith three Euler angles, i.e., yaw, pitch and roll, and estimating thepose directly with these three angles. However, the angle-based poseestimation approach typically has a lower complexity than thelandmark-based approach because the angle-based approach requires justthree values whereas the latter one generally requires more than threelandmark coordinates in its estimation. Moreover, the angle-based poseestimation approach also facilitates performing a simplebest-pose-estimation by using the sum of absolute values of three poseangles.

Both of the above-described face-pose-estimation techniques can beimplemented with traditional methods without using a deep neural networkor with a deep neural network such as a CNN. When implemented with aCNN, face detection module 206 and face-pose-estimation module 208 canbe jointly implemented as a single neural network. A CNN-based jointface-detection and face-pose-estimation system and technique has beendescribed in U.S. patent application Ser. No. 15/789,957, filed on 20Oct. 2017 “Joint Face-Detection and Head-Pose-Angle-Estimation UsingSmall-scale Convolutional Neural Network (CNN) Modules for EmbeddedSystems,” (Attorney Docket No. AVS002.US02CIP), the content of which isincorporated herein by reference.

In face-detection-and-tracking subsystem 200, face-pose-estimationmodule 208 is followed by best-pose-selection module 210 configured todetermine and update the “best pose” for each tracked person from asequence of pose-estimations associated with a sequence of detectedfaces of the tracked person in a sequence of video frames. In someembodiments, the best pose is defined as a face pose closest to thefrontal view (i.e., with the smallest overall head rotations). As can beseen in FIG. 2, best-pose-selection module 210 can be coupled to facetracking module 212 to receive face tracking information. Hence,best-pose-selection module 210 can keep track of each tracked person asthe pose of this person is continuously estimated at face poseestimation module 208 and the best pose of this person is continuouslyupdated at best-pose-selection module 210. In some embodiments, when atracked person is determined to have disappeared from the video by facetracking module 212 (i.e., the associated face tracker is determined tobe lost), best-pose-selection module 210 is configured to transmit thedetected face image corresponding to the current best pose (i.e.,best-pose face image 220) of the tracked person to the control center orthe main server for face recognition tasks.

A number of techniques can be implemented on face tracking module 212 todetermine if a tracked person has disappeared from the video. Forexample, one technique keeps track of the most recently computed facebounding box for each person. Assuming there is no extreme movement of atracked person, then an overlap of some degree is expected between themost-recently computed bounding box of the tracked person and theimmediate preceding bounding box of this tracked person. Hence, whenface tracking module 212 determines that there is no overlap between themost-recently computed bounding box of the tracked person and theimmediate preceding bounding box of this tracked person, face trackingmodule 212 determines that the tracked person has disappeared from thevideo. As another technique, face tracking module 212 can keep track ofthe assigned labels for all of the detected faces (i.e., the faceassociations between a set of unique labels and the corresponding set ofdetected faces). Next, if a given label previously assigned to a trackedperson in a previously processed video image is not assigned to anydetected face in the currently processed video image, the tracked personassociated with the given label can be considered lost.

In some scenarios, a person remaining in the video for a very long timecan cause a long delay before the best pose of that person can betransmitted to the control center or the main server. To mitigate thisproblem, some embodiments provide an early pose submission technique. Ina particular embodiment of the early pose submission technique, duringthe processing of a sequence of video images, if an estimated face poseof a tracked person is determined to be sufficient good in any videoframe, for example when compared with a threshold value, the detectedface image corresponding to the “good-enough pose” can be immediatelytransmitted to the server without waiting for the tracked person toleave the video. More specifically, if facial landmark points aregenerated by the neural network alongside the detected face image, thedistance of the determined landmarks with respect to reference faciallandmarks associated with a full-frontal face pose can be compared witha threshold distance. Alternatively, if pose angles are generated by theneural network alongside the detected face image, the sum of absolutevalues of the estimated pose angles can be compared with the thresholdangle. In both cases, when the newly computed pose metric is below thecorresponding threshold value, the newly computed pose metric of thetracked person can be considered as “good enough” and the correspondingface image of the tracked person can then be transmitted to the server.In these embodiments, to avoid sending duplicated faces of the sameperson, after the tracked person is determined to have disappeared fromthe video, the determined best pose of that person is transmitted to thecontrol center or the server only if no such “good enough” face imagehas been submitted to the control center or the server.

Face Tracking

In face-detection-and-tracking subsystem 200, to find the best pose ofeach tracked person, it is necessary to track the location of thetracked person in each frame of the captured video, from the time theperson is initially detected in the video until the time the person isdetermined to have disappeared from the video.

In some embodiments, the CNN-based face-detection andface-pose-estimation modules 206 and 208 are applied to every frame ofthe captured video 202. In other words, the input to motion detectionmodule 204 in subsystem 200 includes every frame of the capture video202. This is possible for a high-performance embedded vision system 100or when the capture video frame rate is quite low. In these embodiments,face-detection-and-tracking subsystem 200 generates a set of detectedfaces and the corresponding bounding box coordinates for every frame ofthe captured video 202. Using the face detection information from thesequence of video frames, face tracking module 212 can perform eithersingle-face tracking or multi-face tracking in a sequential manner. Forexample, if captured video 202 only includes a single person, facetracking based on the processed video frames simply involves determiningwhen the tracked person is no longer in the video.

If capture video 202 includes multiple people, face tracking module 212needs to be configured to perform multi-face tracking to simultaneouslytrack the multiple people. In one embodiment, multiple people can beinitially detected within a single video frame. In another embodiment,multiple people can be separately detected within multiple video framesas the multiple people enter the video at different times. In someembodiments, after multiple people are detected and labeled, facetracking module 212 performs multi-face tracking to simultaneously trackthe detected multiple people by matching a set of labeled bounding boxesin a previous video frame to the identified bounding boxes in a newlyprocessed video frame. In one embodiment, a Hungarian algorithm can beused to associate labeled faces in the previous video frame to theidentified bounding boxes in the newly processed video frame. Forexample, a similarity matrix between the bounding boxes in the previousvideo frame and the bounding boxes in the newly processed video framecan be constructed, wherein each matrix element measures a similarityscore between a given bounding box in the previous video frame and abounding box in the newly processed video frame. The similarity scorecan be computed using different metrics, one of them being anintersection-of-union between a pair of bounding boxes. For associatingbounding boxes in two consecutive video frames, other data associationtechniques can be used in place of the Hungarian algorithm, and thegenerated CNN features used for data association can be other than thebounding boxes. For example, to improve the face associationperformance, some low-cost face features can be considered, such as theface size, aspect ratio, LBP, HOG, histogram of color, among others.

FIG. 3 presents a flowchart illustrating an exemplary process 300 forperforming real-time face-detection, face-pose-estimation andface-tracking within an embedded vision system in accordance with someembodiments described herein. The process 300 begins by receiving avideo image among a sequence of video frames of a captured video (step302). In some embodiments, the embedded video system includes asurveillance camera system, a machine vision system, a self-drive car,or a mobile phone. Next, process 300 performs a face detection operationon the video image to detect a set of faces in the video image (step304). In some embodiments, performing the face detection operation onthe video image to detect a set of faces in the video image includesidentifying a set of moving areas within the video images using theabove-described motion detection module and for each of the set ofidentified moving areas, applying a CNN-based face detection techniqueto detect if the moving area includes a human face. In some embodiments,each of the detected face images is defined by a bounding box within theoriginal video image.

Next, process 300 determines that a new person has appeared in the videobased on the detected faces (step 306). For example, process 300 canperform a face association operation between a set of labeled detectedfaces in an immediate preceding video image and the set of unlabeledbounding boxes in the current video image. The process subsequentlyidentifies each of the detected faces not associated with a previouslydetected face as a new person. Next, process 300 tracks the new personthrough subsequent video images in the capture video (step 308). Forexample, process 300 can detect a sequence of new locations of the newperson in the subsequent video images. For each of the subsequent videoimages which contains a new location of the new person, process 300 cancompute a face pose for the detected face of the new person at the newlocation; and update the best pose for the new person.

Next, process 300 detects if the tracked new person has disappeared fromthe video (step 310). FIG. 4 presents a flowchart illustrating anexemplary process 400 for detecting if a tracked person has disappearedfrom the video in accordance with some embodiments described herein. Theprocess begins by determining that the tracked person does not have acorresponding detected face in a current video frame (step 402). In someembodiments, determining that the tracked person does not have acorresponding detected face in the current video frame includes failingto detect the tracked person at and around a predicted new location inthe current video frame. Next, the process detects if the tracked personhas a corresponding face at the same location in the current video frameas the location of the detected face of that tracked person in thepreceding video frame to the current video frame (step 404). If so, theprocess determines that the tracked person is not moving and has becomea still face (step 406). Otherwise, the process determines that thetracked person has indeed disappeared from the video (step 408).

Returning to FIG. 3, if it is determined at step 310 that the trackednew person has disappeared from the video, process 300 subsequentlytransmits a detected face of the track new person corresponding to thedetermined best pose to a server (step 312). Note that transmitting justthe face image corresponding to the best pose without sending all of thedetected faces significantly reduces network bandwidth and storagespace. Otherwise, if the tracked new person remains in the video,process 300 continues to track this person through the subsequent videoimages and update the best pose for this person (step 314)

Above discussion has assumed that the CNN-based face-detection andface-pose-estimation modules 206 and 208 can be applied to every videoframe of captured video 204. However, in some embodiments of theembedded vision system 100, processing every video frame with CNNmodules is simply not practical due to the resource and performancelimitations of such embedded systems. Moreover, to reduce computationalcomplexity and increase the real-time video processing speed, theCNN-based face-detection and face-pose-estimation modules 206 and 208 donot have to be applied to every video frame. In some embodiments, motiondetection module 204 only receives a subset of the captured videoframes, e.g., one in every N video frames, and as such, CNN-basedface-detection and face-pose-estimation modules 206 and 208 are onlyapplied to the subset of video frames. FIG. 5 illustrates a capturedsequence of video frames 502 and a corresponding subset of processedvideo frames 504 in accordance with some embodiments described herein.In the embodiment shown, only one in every 4 video frames are processed(i.e., N=4) for face detection and face-pose-estimation. Hence, thosevideo frames in between processed video frames do not have trackinginformation associated with them.

However, it may be necessary to perform face detection and face trackingfor those “in-between” video frames which were not processed by the CNNmodules. For example, in some applications where the bounding boxes of atracked face are continuously displayed on a monitor, it is desirable togenerate bounding boxes for those in-between frames because otherwisethe display of the bounding boxes will flicker constantly. Moreover,when N is large, it becomes more difficult to track multiple people fromone processed video frame 504 to the next processed video frame 504using the face associated technique, because substantially amount ofmovements might have occurred to multiple tracked people. This situationagain requires that face tracking be applied to those in-between videoframes.

In some embodiments, face tracking module 212 can be configured tolocate and label the tracked faces within the in-between video frameswithout the need of applying face-detection module 206. In someembodiments, face tracking module 212 is configured to determine thelocation of each tracked face within an unprocessed video frame (e.g.,Frame 2) immediately following a processed frame 504 (e.g., Frame 1)based on the determined location of the tracked face in the processedframe (e.g., Frame 1). For example, FIG. 6 presents a flowchartillustrating an exemplary process 600 for performing face detection andtracking for unprocessed video frames based on the processed videoframes in accordance with some embodiments described herein. In someembodiments, exemplary process 600 is implemented on face trackingmodule 212.

For each detected face in a processed video frame, process 600 locatesthe corresponding bounding box of the detected face as a reference boxand the detected face image within the bounding box as a search block(step 602). Next, in a subsequent unprocessed video frame, process 600uses the search block to search within a search window of apredetermined size and centered around the same location of thereference box in the unprocessed video frame (step 604). Morespecifically, within the search window, multiple locations (e.g., 64different locations) of the size of the reference box can be searched.At each of the search locations within the search window, the searchblock is compared with the image patch within the reference box (step606). Hence, process 600 identifies the same detected face in theunprocessed video frame at a search location where the best matchbetween the search block and the corresponding image patch is found(step 608). Note that as long as the search window in the unprocessedvideo frame is sufficiently large and the new location of the detectedface does not change too much between two consecutive frames (i.e.,assuming no extreme movement), this direct search technique is able toaccurately locate the detected face in the unprocessed video framewithout using a neural network, and regardless whether the movement ofthe detected face is linear or nonlinear. Next, process 600 can place acorresponding bounding box at the search location wherein the best matchis identified to indicate the new location of the detected face in theunprocessed video frame (step 610). Note that after applying process 600to a given unprocessed video frame (e.g., Frame 2 in FIG. 5), process600 can be repeated for other unprocessed video frames (e.g., Frames 3-4in FIG. 5) immediately following the newly processed video frame basedon the detected faces within the newly processed video frame (e.g.,Frame 2).

In one embodiment, process 600 can compare the search block against theimage patch at a given search location within the search window bycomputing a similarity score between the search block and the comparedimage patch. In some embodiments, the similarity between the searchblock and the image patch can be simply computed as the differencebetween the search block and the image patch.

In some embodiments, to speed up the above-described search process,face tracking module 212 is configured to predict the new location ofeach detected face in an unprocessed video frame based on using thereference locations in the processed video image and predicted motionsof the detected faces. More specifically, for each detected face in aprocessed video frame (e.g., Frame #5 in FIG. 5), face tracking module212 first makes a prediction of the estimated location of the detectedface in the unprocessed video frame (e.g., Frame #6 in FIG. 5). In theseembodiments, a movement (e.g., the trajectory and speed) of the detectedface is first predicted based on multiple face locations of the detectedface in the previously processed video frames. For example, in theillustrated video sequence 502 in FIG. 5, the detected face locations inFrames 1 and 5 can be used to predict the new locations of the detectedface in Frames 6-8. Note that the prediction of the movement can includeeither a linear prediction or a non-linear prediction. In linearprediction, the trajectory and speed of the movement can be predicted.In non-linear prediction, a Kalman filter approach can be applied.

Next, for each detected face in the processed video frame, face trackingmodule 212 uses a corresponding search block to search around theestimated new location (i.e., the search location) of the detected facein the unprocessed video frame. Note that due to the improved accuracyof the estimated location, face tracking module 212 does not need tosearch many positions around the estimated location. At each of thesearch locations centered around the estimated location, the searchblock is compared with the image patch within the search box. Hence, thesame detected face can be identified in the unprocessed video frame at asearch location where the best match between the search block and thecorresponding image patch is found. As a variation to the above process,face tracking module 212 can directly apply the bounding box of thedetected face from the previous frame to the estimated location in theunprocessed video frame and use the image patch within the bounding boxat the estimated location as the detected face in the unprocessed videoframe.

To further reduce the computational complexity and speed up to theface-tracking process, motion detection by motion detection module 204can be performed on a downsampled/low-resolution version of the sequenceof frames, which can be obtained through many standard face detectionschemes. One of the approaches for generating downsampled version of theinput video frames and performing face detection within such images wasdescribed in U.S. patent application Ser. No. 15/657,109, filed on 21July, entitled “Face Detection Using Small-scale Convolutional NeuralNetwork (CNN) Modules for Embedded Systems,” (Attorney Docket No.AVS002.US01), the content of which is incorporated herein by reference.

Face-Tracking Under Low Frame Processing Rate

As mentioned above, applying face-detection and face-pose-estimationmodules 206 and 208 to every input video frame can be computationalintensive due to multi-stage CNN operations. For some embedded systemswhich capture videos at a high frame rate, performing real-timeface-detection and face-pose-estimation on the captured video imagesbecome quite challenging. Some low-end embedded video system may not beable to process the input video frames as fast as the new video framesare captured if the processing speed lags the high input frame rate. Insuch cases, some embedded systems are only able to perform CNN-basedface detection and face-pose-estimation on a subset of video frame,e.g., one in every N input video frames (e.g., N=4). As a result, noface detection and face-pose estimation are made for these unprocessedor “in-between” video frames. In such systems, face tracking performancecan be quite poor and a tracked face can be easily lost.

To mitigate this problem, one approach is to use the determined facelocations in the last two or more processed video frames to predict theface locations for the unprocessed frames immediately following the lastone of the two or more processed frames used for making suchpredictions. For example, in the illustrated example of FIG. 5,processed Frames 1 and 5 can be used to predict face locations forFrames 6-8. As another example, processed Frames 1, 5, and 9 can be usedto predict face locations for Frames 10-12. In some embodiments, facelocation predictions for the unprocessed frames can include using alinear prediction or a more complex scheme such as Kalman filterapproach.

Another approach to mitigate the low-frame-processing-rate probleminvolves using motion estimation to search in one or more subsequentlyunprocessed video frames for the new location of each detected face inthe previous processed frame. Again using FIG. 5 as an example, assumingCNN-based face-detection and face-pose-estimation modules 206 and 208have been applied to Frame 1 and Frame 5. Due to the large gap betweenthe two processed frames, it can be difficult to directly associatelabels in Frame 1 to the detected faces in Frame 5. In some embodiments,the above-described search and label process described in conjunctionwith FIG. 6 can be recursively applied from Frame 1 to Frame 5. Morespecifically, using the detected and labeled faces in Frame 1,corresponding faces in Frame 2 are then searched and the reappearingfaces in Frame 2 are subsequently labeled. Next, the detected andlabeled faces in Frame 2 are used as references to search and labelfaces in Frame 3, and so on. Eventually, the faces in Frame 4 are alsosearched and labeled based on the original face detection informationfrom Frame 1. Next, the labeled faces in Frame 4 can be used to labelpreviously detected faces in Frame 5 using one of the standard faceassociation techniques.

As a special case, the CNN-based face-detection and pose-estimationmodules 206 and 208 can be applied to every other video frame, forexample, Frames 1, 3, 5, 7, etc (i.e.,N=2). The CNN-processed frames cansubsequently be labeled with one of the face association techniques,such as the intersection-of-union (IoU) technique as described above.Next, for each of the in-between unprocessed video frames (e.g., Frame2), the location of each tracked face can be simply determined by theinterpolation of the corresponding locations of the tracked face in theimmediate preceding and the immediate following processed video frames(e.g., Frames 1 and 3). Note that the above technique can be easilyextended to a scenario where the CNN-based face-detection andpose-estimation modules 206 and 208 are applied to one in every threevideo frames (i.e., N=3).

Detection of Still Faces

As mentioned above, if it is assumed that a detected person in a capturevideo is constantly moving and a motion detection operation is used topreprocess the video frames to extract and process only those movingareas, then when a person stops moving at some point in the videoimages, the face of that person cannot be detected in the subsequentvideo frames. Note that this problem does not exist if motion detectionmodule 204 is removed from face-detection-and-tracking subsystem 200,and the entire input video image is processed by the subsequent modules.A technique for detection a person who has stopped moving has beendescribed above in conjunction with FIG. 4. This technique can also beused to continue monitoring the stopped person through more video framesas long as the person remains still and until this person starts movingagain. Then, the above-described face tracking techniques can again beapplied to the person in motion.

Displaying High Frame Rate

Many surveillance systems include a video previewing feature designed toallow the control center to preview the captured video of each camera inreal time. However, because some low-cost CNN-based face detectionmodule 206 implemented on embedded vision system 100 may not be able torun at high frame rate (e.g., at 30 frames per second (fps) videocapturing rate), displaying only a subset of processed frames and theassociated detected-face bounding boxes can have a very poor visualquality, due to the significantly-reduced frame rate.

In some embodiments, to improve the visual quality in the previewingmode, one disclosed technique is designed to generating a high framerate display by introducing a delay in the display. Again using FIG. 5as an example, note that the input video sequence is processed at every4th frame. To display the captured video, the display starts with adelay of 4 frames, and the first processed Frame 1 is displayed followedby the next three unprocessed Frames 2-4. Next, the second processedFrame 5 is displayed followed by the next three unprocessed Frames 6-8,and the process goes on. Although not all displayed frames will show thedetected face bounding boxes, the displayed video can be played at theoriginal frame rate, and therefore can be as smooth as the originalvideo. Note that the illustrated delay 506 of 4 frames is only used asan example. In general delay 506 can be determined based on theprocessing time required, and set to be greater than the determinedprocessing delay.

Improved Face-Detection and Face-Tracking Subsystem

FIG. 7 shows a block diagram of an exemplary implementation 700 offace-detection-and-tracking subsystem 112 within embedded vision system100 in accordance with some embodiments described herein. Note that theoverall structure of face-detection-and-tracking subsystem 700 issubstantially identical to that of face-detection-and-tracking subsystem200. That is, face-detection-and-tracking subsystem 700 receives asequence of video images 702 as input and generates a set of best-poseface images 720 for a set of uniquely detected faces/persons from videoimages 702 as output. Again, face-detection-and-tracking subsystem 700includes a motion detection module 704, a face detection module 706, aface-pose-estimation module 708, a best-pose-selection module 710, and aface tracking module 712. Face-detection-and-tracking subsystem 700 isconfigured to detect new faces and track multiple faces within capturedvideo 702 at the same time. Ideally, face-detection-and-trackingsubsystem 700 is configured to simultaneously track as many people aspossible.

In some embodiments, face detection module 706 can be implemented with aDL-based MTCNN architecture described in Zhang et al., “Joint FaceDetection and Alignment Using Multitask Cascaded ConvolutionalNetworks,” IEEE Signal Processing Letters, Vol. 23, No. 10, pp.1499-1503, October 2016. This MTCNN architecture integrates facedetection and face alignment operations using unified cascaded CNNsthrough a multi-task learning process. Similar to a cascaded-CNNframework described in Li et al., “A Convolutional Neural NetworkCascade for Face Detection,” Proc. IEEE Conf on Computer Vision andPattern Recognition, Jun. 1, 2015, the MTCNN also uses severalcoarse-to-fine CNN stages to operate on different resolutions of theinput image. However, in the MTCNN architecture, facial landmarklocalization, binary face classification, and bounding box calibrationare trained jointly using a single CNN in each stage. As a result, onlythree stages are needed in the MTCNN architecture. More specifically,the first stage of the MTCNN generates candidate facial windows quicklythrough a shallow CNN. Next, the second stage of the MTCNN refines thecandidate windows by rejecting a large number of non-face windowsthrough a more complex CNN. Finally, the third stage of the MTCNN uses amore powerful CNN to further decide whether each input window is a faceor not. If it is determined to be so, the locations of five faciallandmarks are also estimated. The MTCNN architecture is generally moresuitable for implementations on resource-limited embedded vision systemscompared to the cascaded CNN framework. Other than using the MTCNNarchitecture, face detection module 706 can also be implemented withother known or later developed CNN-based face-detection architecturesand techniques without departing from the scope of the describedtechnology. Face detection module 706 generates a set of detected faces716 and the corresponding bounding box locations.

Although the MTCNN architecture is generally applicable to most embeddedvision systems, for many embedded vision systems with resourceconstraints (such as Hi3519 (SoC)), the original MTCNN architecture canstill consume a large amount of computational resources, limit theface-detection speed, and ultimately the number of people that can besimultaneous tracked by face detection module 706. In some embodiments,instead of using the original MTCNN architecture to implement facedetection module 706, face detection module 706 can be implemented witha “thinner” version of the original MTCNN architecture to speed up theface-detection operation based on a technique described in Howard etal., “MobileNets: Efficient Convolutional Networks for Mobile VisionApplications,”, CoRR, April, 2017 (or “MobileNets” hereinafter).

Using the techniques of MobileNets, face detection module 706 can beimplemented with a modified MTCNN architecture which uses a reductionparameter α to reduce the number of filters in each of the CONV layers.For a given CONV layer, and a given α, the number of input channels Mbecomes αM and the number of output channels N becomes αN. In someembodiments, reduction parameter α has a value between 0 and 1, withsome typical settings of 0.25, 0.25, 0.5, 0.75, and 1, wherein α=1 meansno reduction. For example, when only 75% of the filters are used for allthe CONV layers (i.e., α=0.75) in MobileNets, only a slight performancedrop was observed in terms of classification accuracy. Similarly, bykeeping only 75% of the filters for all the CONV layers in the threenetworks of the MTCNN architecture, a “0.75 MTCNN” is obtained. Fortesting, the 0.75 MTCNN is implemented on an embedded platform using ARMCortex-A7 Core, trained with Wider Face Dataset, and tested with FaceDetection Dataset and Benchmark (FDDB) with all R/G/B channels asinputs. The face inference results based on this 0.75 MTCNN architectureshowed less than 1% accuracy drop, but the inference speed of the systemis reduced by approximate one half. In practice, face detection module706 based on a thinner MTCNN architecture with a reduction parameter αbetween (0.5, 0.75) would achieve satisfactory face detection accuracybut significantly faster processing speed.

Note that certain cameras in some embedded vision systems, such as somelegacy surveillance systems only capture grayscale video images 702.Moreover, video images 702 may be captured under monochrome andgrayscale illumination conditions so that captured video images 702 arecloser to grayscale images than to color images. For example, night timeillustrations in cities, such as bus station light boxes and streetlights, are increasingly using LED light sources, which are mostlymonochrome and grayscale in nature. As another example, illuminationsfor public transportations such as subway stations and inside subwaytrains are also increasingly using LED lighting. In these situations,surveillance video images captured by cameras are also substantiallygrayscale images, even if the cameras are capable of capturing colorimages. Hence, when input video images 702 toface-detection-and-tracking subsystem 700 are grayscale images, imagepatches 714 to face detection module 706 can be formatted as a singlegrayscale channel instead of 3 R/G/B channels.

However, the performance of face detection module 706 can be degradedwhen face detection module 706 is trained using a regular trainingdataset while the input video images 702 are grayscale images. This isbecause public training datasets, such as Wider Face Dataset are mostlycomposed of RGB images captured during natural illumination conditions.Hence, if the CNN within face detection module 706 is trained with RGBimages while the input video images 702 are grayscale images, then therecan exist a significant amount of data distribution skew between thetraining dataset and the input images. As such, the trained CNN wouldhave a reduced accuracy when processing grayscale images 702. In aparticular performance test, face detection module 706 is implementedwith the original MTCNN architecture and trained with RGB images, whilethe grayscale input images are treated as three identical channels. Inthis particular test, the face detection performance is found to besufficient good. However, in another performance test when facedetection module 706 is implemented with the aforementioned 0.75 MTCNN,the trained 0.75 MTCNN using RGB training images does not perform wellon grayscale input images 702.

One solution to the above-described problem is to reduce the datadistribution skew by making the training images and the processed videoimages to be more consistent with each other. In particular, when theinput video images 702 are grayscale images, it is desirable to trainface detection module 706 with grayscale images. However, when directlyacquiring a large amount of grayscale training images is not feasible,in some embodiments, the training images from a large-scale trainingdatabase, e.g., Wider Face dataset can be first converted into grayscaleimages, and these converted grayscale images are subsequently used totrain face detection module 706. Note that the same concept can beapplied to other situations of particular types of input images. Inother words, based on some determined features of video images 702, thetraining dataset may be modified to reduce the data distribution skew bymaking the training images more consistent with or similar to videoimages 702.

FIG. 8 presents a flowchart illustrating an exemplary process forimproving face detection performance on processing grayscale inputimages in accordance with some embodiments described herein. The processbegins by receiving a training face-image dataset, e.g., the color FERETDatabase or Wider Face dataset (step 802). The process next converts thecolor images in the training dataset into grayscale images (step 804).Next, the process trains the face detection CNN module using theconverted grayscale images (step 806). Next, during aface-detection-and-tracking application, the process receives grayscalevideo images and performs face detection on the grayscale images usingthe face detection CNN module trained with grayscale training images(step 808).

Based on the above-described technique, the aforementioned performancetest is modified. Specifically, the face detection module implementedwith 0.75 MTCNN architecture (or the “0.75 MTCNN module”) is trainedwith converted grayscale images of Wider Face dataset and the trained0.75 MTCNN module is used to process images in the FDDB. It is noticedthat the performance of the grayscale-trained 0.75 MTCNN module is onlyslightly (˜1.5%) worse than the RGB-trained full size MTCNN module.However, the performance of the grayscale-trained 0.75 MTCNN module issignificantly better on grayscale input images than the performance ofRGB-trained 0.75 MTCNN module on the same input images. Hence, theproposed thinner MTCNN architecture can be combined with the proposedCNN training technique to improve both the performance and the speed offace detection module 706 on processing grayscale input images.

As described above in conjunction with FIGS. 2-3, during a real-timeface tracking processing, face-detection-and-tracking subsystem 700keeps track of the best pose face for each face tracker. When a givenface tracker is determined to be lost, and needs to be removed, thecurrent best-pose-face can be transmitted to the server for furtheranalysis, e.g., face recognition. Often times, the face tracker isconsidered lost when the face is temporarily being blocked by anotherperson or an object in the same image frame. When the blocked face isunblocked and reappears in a subsequent image frame, the face can bedetected again. Ideally, the re-detected face should be recognized as anexisting face and not a new face. However, simple face trackingtechniques based on Kalman filtering and Hungarian algorithm cannothandle such occlusion situation effectively, which often result ingenerating many near-duplicate faces. If such near-duplicate faces arenot recognized by face-detection-and-tracking subsystem 700, duplicatedbest-pose-faces would be sent to the server which wastes both networkbandwidth and storage space.

In some embodiments, to mitigate the above-described problem, prior tosending a best pose face to the server when the associated face trackeris lost, a near-duplicate face detection procedure is performed. Morespecifically, before sending each best pose face to the server, afeature extraction operation is performed on that best pose face toextract a predetermined image feature from the face image. For example,the predetermined feature can be a HoG feature, a Harr-like feature, ascale-invariant-feature transform (SIFT)-feature, or one of the DL-basedface features. However, in some embodiments, more than one type of imagefeature can be extracted from the best pose face. For example, oneexample implementation can extract both HoG feature and Han-like featurefrom the best pose face.

Next, the extracted image feature is compared one by one with the storedfeatures extracted from previously transmitted best pose faces. Notethat these stored features can be stored in a local buffer within theembedded vision system of the face-tracking operation. However, thestored features can also be stored on a remote storage separated fromthe embedded vision system of the face-tracking operation, e.g., on acloud server. In some embodiments, for one-dimensional (1D) ortwo-dimensional (2D) features such as HoG feature and Harr-like feature,a cosine or Euclidean similarity can be computed between the newlyextracted feature of the best pose face and a stored feature of apreviously transmitted best pose face. Note that when DL-based facefeatures are used, comparing the extracted feature and stored featurescan detect a duplicated face from a stored face of the same person, butthe stored face can have a different pose from that of the duplicatedface. In some embodiments, if any of the computed similarity valuesbetween the newly extracted feature and the stored features is above apredetermined threshold, e.g., 0.8, the best pose face is determined tobe a near-duplicate face and will not be sent to the server. However, ifall computed similarity values between the newly extracted feature andthe stored features are below a predetermined threshold, the best poseface is considered to be associated with a unique face, which is thentransmitted to the server and the associated extracted feature is storedinto the feature buffer or storage. Using the above-described technique,both the number of near-duplicate faces and transmitted best pose facescan be significantly reduced, thereby saving network bandwidth andstorage resources.

FIG. 9 presents a flowchart illustrating an exemplary process foridentifying near-duplicate faces and selectively transmittingbest-pose-faces to the server in accordance with some embodimentsdescribed herein. The process begins by receiving, at the end of aface-tracking operation, a determined best-pose-face image associatedwith a tracked face when the face tracker is determined to be lost (step902). In some embodiments, the tracked face is determined to be lostwhen the tracked face is blocked by an object for a predetermined numberof image frames. The process next extracts an image feature from thebest-pose-face image (step 904). As described above, the extractedfeature can be a HoG feature, a Han-like feature, a SIFT-feature, aDL-based face feature, or a combination of the above. Next, the processcomputes a set of similarity values between the newly extracted featureand each of a set of stored features extracted from a set of previouslytransmitted best-pose-face images and stored in a feature buffer (step906). Next, the process determines if any of the computed similarityvalues is above a predetermined threshold (step 908). If the processdetermines no computed similarity value is above the predeterminedthreshold, the process transmits the best-pose-face image to the serverand stores the associated extracted feature into the feature buffer(step 910). Otherwise, the process determines that the best-pose-faceimage is a near-duplicate face and prevents transmitting thebest-pose-face image to the server (step 912).

FIG. 10 illustrates an example client-server network environment whichprovides for implementing the disclosed embedded vision system inaccordance with some embodiments described herein. A network environment1000 includes a number of embedded vision systems 1002, 1004 and 1006communicably connected to a server 1010 by a network 1008. One or moreremote servers 1020 are further coupled to the server 1010 and/or theone or more embedded vision systems 1002, 1004 and 1006.

In some example embodiments, embedded vision systems 1002, 1004 and 1006can include surveillance camera systems, machine vision systems, drones,robots, self-driving vehicles, smartphones, PDAs, portable mediaplayers, tablet computers, or other embedded systems integrated with oneor more digital cameras. In one example, each of embedded vision systems1002, 1004 and 1006 includes one or more cameras, a CPU, a DSP, and oneor more small-scale CNN-modules.

Server 1010 includes a processing device 1012 and a face database 1014.Processing device 1012 is configured to execute programs to perform faceanalysis on the received face images from embedded vision systems 1002,1004 and 1006 based on the stored faces in face database 1014.Processing device 1012 is also configured to store processed face imagesinto face database 1014.

In some example aspects, server 1010 can be a single computing devicesuch as a computer server. In other embodiments, server 1010 canrepresent more than one computing device working together to perform theactions of a server computer (e.g., cloud computing). The server 1010may host the web server communicably coupled to the browser at theclient device (e.g., embedded vision systems 1002, 1004 and 1006) vianetwork 1008. In one example, the server 1010 may host a clientapplication for scheduling a customer-initiated service or aservice-provider-initiated service between a service provider and acustomer during a service scheduling process. Server 1010 may further bein communication with one or more remote servers 1020 either through thenetwork 1008 or through another network or communication means.

The one or more remote servers 1020 may perform various functionalitiesand/or storage capabilities described herein with regard to the server1010 either alone or in combination with server 1010. Each of the one ormore remote servers 1020 may host various services. For example, servers1020 may host services providing information regarding one or moresuggested locations such as web pages or websites associated with thesuggested location, services for determining location of one or moreusers, or establishments, search engines for identifying results for auser query, one or more user review or query services, or one or moreother services providing information regarding one or moreestablishments, customers and/or review or feedback regarding theestablishments.

Server 1010 may further maintain or be in communication with socialnetworking services hosted on one or more remote server 1020. The one ormore social networking services may provide various services and mayenable users to create a profile and associate themselves with otherusers at a remote social networking service. The server 1010 and/or theone or more remote servers 1020 may further facilitate the generationand maintenance of a social graph including the user createdassociations. The social graphs may include, for example, a list of allusers of the remote social networking service and their associationswith other users of a remote social networking service.

Each of the one or more remote servers 1020 can be a single computingdevice such as a computer server or can represent more than onecomputing device working together to perform the actions of a servercomputer (e.g., cloud computing). In one embodiment server 1010 and oneor more remote servers 1020 may be implemented as a single server oracross multiple servers. In one example, the server 1010 and one or moreremote servers 1020 may communicate through the user agent at the clientdevice (e.g., embedded vision systems 1002, 1004 and 1006) via network1008.

Users of embedded vision systems 1002, 1004 and 1006 may interact withthe system hosted by server 1010, and/or one or more services hosted byremote servers 1020, through a client application installed at embeddedvision systems 1002, 1004 and 1006. Alternatively, the user may interactwith the system and the one or more social networking services through aweb based browser application at embedded vision systems 1002, 1004 and1006. Communication between embedded vision systems 1002, 1004 and 1006and the system, and/or one or more services, may be facilitated througha network (e.g., network 1008).

Communications between embedded vision systems 1002, 1004 and 1006,server 1010 and/or one or more remote servers 1020 may be facilitatedthrough various communication protocols. In some aspects, embeddedvision systems 1002, 1004 and 1006, server 1010 and/or one or moreremote servers 1020 may communicate wirelessly through a communicationinterface (not shown), which may include digital signal processingcircuitry where necessary. The communication interface may provide forcommunications under various modes or protocols, including Global Systemfor Mobile communication (GSM) voice calls, Short Message Service (SMS),Enhanced Messaging Service (EMS), or Multimedia Messaging Service (MMS)messaging, Code Division Multiple Access (CDMA), Time Division MultipleAccess (TDMA), Personal Digital Cellular (PDC), Wideband Code DivisionMultiple Access (WCDMA), CDMA2000, or General Packet Radio System(GPRS), among others. For example, the communication may occur through aradio-frequency transceiver (not shown). In addition, short-rangecommunication may occur, including using a Bluetooth, WiFi, or othersuch transceiver.

Network 1008 can include, for example, any one or more of a personalarea network (PAN), a local area network (LAN), a campus area network(CAN), a metropolitan area network (MAN), a wide area network (WAN), abroadband network (BBN), the Internet, and the like. Further, thenetwork 1008 can include, but is not limited to, any one or more of thefollowing network topologies, including a bus network, a star network, aring network, a mesh network, a star-bus network, tree or hierarchicalnetwork, and the like.

The various illustrative logical blocks, modules, circuits, andalgorithm steps described in connection with the embodiments disclosedherein may be implemented as electronic hardware, computer software, orcombinations of both. To clearly illustrate this interchangeability ofhardware and software, various illustrative components, blocks, modules,circuits, and steps have been described above generally in terms oftheir functionality. Whether such functionality is implemented ashardware or software depends upon the particular application and designconstraints imposed on the overall system. Skilled artisans mayimplement the described functionality in varying ways for eachparticular application, but such implementation decisions should not beinterpreted as causing a departure from the scope of the presentdisclosure.

The hardware used to implement the various illustrative logics, logicalblocks, modules, and circuits described in connection with the aspectsdisclosed herein may be implemented or performed with a general purposeprocessor, a digital signal processor (DSP), an application specificintegrated circuit (ASIC), a field programmable gate array (FPGA) orother programmable logic device, discrete gate or transistor logic,discrete hardware components, or any combination thereof designed toperform the functions described herein. A general-purpose processor maybe a microprocessor, but, in the alternative, the processor may be anyconventional processor, controller, microcontroller, or state machine. Aprocessor may also be implemented as a combination of receiver devices,e.g., a combination of a DSP and a microprocessor, a plurality ofmicroprocessors, one or more microprocessors in conjunction with a DSPcore, or any other such configuration. Alternatively, some steps ormethods may be performed by circuitry that is specific to a givenfunction.

In one or more exemplary aspects, the functions described may beimplemented in hardware, software, firmware, or any combination thereof.If implemented in software, the functions may be stored as one or moreinstructions or code on a non-transitory computer-readable storagemedium or non-transitory processor-readable storage medium. The steps ofa method or algorithm disclosed herein may be embodied inprocessor-executable instructions that may reside on a non-transitorycomputer-readable or processor-readable storage medium. Non-transitorycomputer-readable or processor-readable storage media may be any storagemedia that may be accessed by a computer or a processor. By way ofexample but not limitation, such non-transitory computer-readable orprocessor-readable storage media may include RAM, ROM, EEPROM, FLASHmemory, CD-ROM or other optical disk storage, magnetic disk storage orother magnetic storage devices, or any other medium that may be used tostore desired program code in the form of instructions or datastructures and that may be accessed by a computer. Disk and disc, asused herein, includes compact disc (CD), laser disc, optical disc,digital versatile disc (DVD), floppy disk, and Blu-ray disc where disksusually reproduce data magnetically, while discs reproduce dataoptically with lasers. Combinations of the above are also includedwithin the scope of non-transitory computer-readable andprocessor-readable media. Additionally, the operations of a method oralgorithm may reside as one or any combination or set of codes and/orinstructions on a non-transitory processor-readable storage mediumand/or computer-readable storage medium, which may be incorporated intoa computer program product.

While this patent document contains many specifics, these should not beconstrued as limitations on the scope of any disclosed technology or ofwhat may be claimed, but rather as descriptions of features that may bespecific to particular embodiments of particular techniques. Certainfeatures that are described in this patent document in the context ofseparate embodiments can also be implemented in combination in a singleembodiment. Conversely, various features that are described in thecontext of a single embodiment can also be implemented in multipleembodiments separately or in any suitable subcombination. Moreover,although features may be described above as acting in certaincombinations and even initially claimed as such, one or more featuresfrom a claimed combination can in some cases be excised from thecombination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. Moreover, the separation of various system components in theembodiments described in this patent document should not be understoodas requiring such separation in all embodiments.

Only a few implementations and examples are described and otherimplementations, enhancements and variations can be made based on whatis described and illustrated in this patent document.

What is claimed is:
 1. A computer-implemented method for performingimage detections, the method comprising: receiving a training imagedataset, wherein the training image dataset includes a first subset ofcolor images; converting each image in the first subset of color imagesin the training image dataset into a grayscale image to obtain a firstsubset of converted grayscale images; training an image-detectionstatistical model using the training image dataset including the firstsubset of converted grayscale images; receiving a set of grayscale inputimages; and performing image detections on the set of grayscale inputimages using the trained image-detection statistical model.
 2. Thecomputer-implemented method of claim 1, wherein the training imagedataset is a large-scale public training dataset composed of primarilycolor images.
 3. The computer-implemented method of claim 1, wherein theset of grayscale input images are captured under a monochrome or agrayscale illumination condition.
 4. The computer-implemented method ofclaim 3, wherein the monochrome or grayscale illumination conditionincludes an LED lighting.
 5. The computer-implemented method of claim 1,wherein the set of grayscale input images are captured by a cameraconfigured to capture only grayscale images.
 6. The computer-implementedmethod of claim 1, wherein the training image dataset further includes asecond subset of grayscale images, and wherein training the imagedetection statistical model using the training image dataset includesusing both the first subset of converted grayscale images and the secondsubset of grayscale images.
 7. The computer-implemented method of claim1, wherein the training image dataset is a face-image training dataset,and wherein the image-detection statistical model is a face-detectionstatistical model.
 8. The computer-implemented method of claim 7,wherein the face-detection statistical model includes aconvolutional-neural-network (CNN) face-detection module, and whereinthe CNN face-detection module further includes a multitask-cascaded-CNN(MTCNN).
 9. The computer-implemented method of claim 1, whereinconverting the training image dataset into grayscale images reduces datadistribution skews between the training image dataset and the set ofgrayscale input images.
 10. The computer-implemented method of claim 1,wherein performing image detections on the set of grayscale input imagesusing the image-detection statistical model trained on the grayscaletraining images improves image detection accuracy over using animage-detection statistical model trained on color training images. 11.An apparatus for performing image detections, comprising: one or moreprocessors; and a memory coupled to the one or more processors, whereinthe memory stores instructions that, when executed by the one or moreprocessors, cause the apparatus to: receive a training image dataset,wherein the training image dataset includes a first subset of colorimages; convert each image in the first subset of color images in thetraining image dataset into a grayscale image to obtain a first subsetof converted grayscale images; train an image-detection statisticalmodel using the training image dataset including the first subset ofconverted grayscale images; receive a set of grayscale input images; andperform image detections on the set of grayscale input images using thetrained image-detection statistical model.
 12. The apparatus of claim11, wherein the training image dataset is a large-scale public trainingdataset composed of primarily color images.
 13. The apparatus of claim11, wherein the set of grayscale input images are captured under amonochrome or a grayscale illumination condition.
 14. The apparatus ofclaim 13, wherein the monochrome or grayscale illumination conditionincludes an LED lighting.
 15. The apparatus of claim 11, wherein the setof grayscale input images are captured by a camera configured to captureonly grayscale images.
 16. The apparatus of claim 11, wherein thetraining image dataset further includes a second subset of grayscaleimages, and wherein training the image detection statistical model usingthe training image dataset includes using both the first subset ofconverted grayscale images and the second subset of grayscale images.17. The apparatus of claim 11, wherein the training image dataset is aface-image training dataset, and wherein the image-detection statisticalmodel is a face-detection statistical model.
 18. The apparatus of claim17, wherein the face-detection statistical model includes aconvolutional-neural-network (CNN) face-detection module, and whereinthe CNN face-detection module further includes a multitask-cascaded-CNN(MTCNN).
 19. The embedded system of claim 14, wherein performing imagedetections on the set of grayscale input images using theimage-detection statistical model trained on the grayscale trainingimages improves image detection accuracy over using an image-detectionstatistical model trained on color training images.
 20. A system forperforming image detections, comprising: a machine learning modulecomprising a non-transitory computer-readable medium comprisinginstructions that, when executed by one or more processors, cause thesystem to to: receive a training image dataset, wherein the trainingimage dataset includes a first subset of color images; convert eachimage in the first subset of color images in the training image datasetinto a grayscale image to obtain a first subset of converted grayscaleimages; train an image-detection statistical model using the trainingimage dataset including the first subset of converted grayscale images;receive a set of grayscale input images; and perform image detections onthe set of grayscale input images using the trained image-detectionstatistical model.